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Abstract 

Background: Label-free quantitative proteomics holds a great deal of promise for the future study of both 
medicine and biology. However, the data generated is extremely intricate in its correlation structure, and its proper 
analysis is complex. There are issues with missing identifications. There are high levels of correlation between many, 
but not all, of the peptides derived from the same protein. Additionally, there may be systematic shifts in the 
sensitivity of the machine between experiments or even through time within the duration of a single experiment. 

Results: We describe a hierarchical model for analyzing unbiased, label-free proteomics data which utilizes the 
covariance of peptide expression across samples as well as MS/MS-based identifications to group peptides — a 
strategy we call metaprotein expression modeling. Our metaprotein model acknowledges the possibility of 
misidentifications, post-translational modifications and systematic differences between samples due to changes in 
instrument sensitivity or differences in total protein concentration. In addition, our approach allows us to validate 
findings from unbiased, label-free proteomics experiments with further unbiased, label-free proteomics experiments. 
Finally, we demonstrate the clinical/translational utility of the model for building predictors capable of 
differentiating biological phenotypes as well as for validating those findings in the context of three novel cohorts of 
patients with Hepatitis C. 

Conclusions: Mass-spectrometry proteomics is quickly becoming a powerful tool for studying biological and 
translational questions. Making use of all of the information contained in a particular set of data will be critical to 
the success of those endeavors. Our proposed model represents an advance in the ability of statistical models of 
proteomic data to identify and utilize correlation between features. This allows validation of predictors without 
translation to targeted assays in addition to informing the choice of targets when it is appropriate to generate 
those assays. 
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Background (LC-MS) proteomic technology, which has become the 

Mass spectrometry proteomics state-of-the-art for differential expression proteomic 

The field of proteomics has made remarkable advances studies in most major laboratories around the world, 

in analytical hardware and software which have provided typically uses a 'bottom-up' approach, where the samples 

increasingly sensitive and robust analyses on platforms are subjected to a total proteolytic digestion, and the 

capable of detecting low abundance proteins from com- peptide surrogates' of the protein are quantified and 

plex mixtures, such as serum and cell lysates. The nano- identified using tandem mass spectrometry. There are 

scale liquid chromatography and mass spectrometry two general approaches used for bottom-up differential 

expression proteomics via LC-MS, those based on the 

use of stable isotope labeling or tagging of the peptides, 
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in recent years such that currently both relative and ab- 
solute quantitation of proteins is possible from complex 
mixtures by either labeled or label-free methodologies 
[5]. 

Although a specific advantage of the labeling 
approaches is the ability to heavily fractionate the sam- 
ples to "dig deeper" into the proteome while maintaining 
quantitative capabilities, extensive fractionation of the 
sample is often impractical in the context of a clinical 
study with tens or even hundreds of samples. The pro- 
teomics community has seen a significant increase in the 
use of the label free approach due to increased instru- 
ment stability and software sophistication, and it is 
emerging as the method of choice for larger clinically- 
based studies where use of the labeling strategies is im- 
possible or impractical In particular, an advantage of 
label-free strategies which measure area-under-the-curve 
(AUC) of the LC-MS peak is that any of a number of 
commercial or open-source software packages can be 
used to extract ion intensities from each individual ana- 
lysis, and statistical analysis on the relative abundance of 
these ions can be performed even in the absence of a 
peptide identification. The ability to precisely and repro- 
ducibly quantify thousands of proteolytic peptides using 
the label-free method was demonstrated by Wang, et al. 
and has been since employed and reproduced in a num- 
ber of laboratories [4,6] . 

Techniques for aggregating peptides into larger units 
generally revolve around protein identifications. A variety 
of approaches exist to combine individual peak areas to 
generate a relative or absolute aggregate expression levels. 
Once peptides are assigned to their parent proteins, using 
an algorithm such as ProteinProphet, either the peptide 
frequency of observation (Epectral counting) or MS inten- 
sity is used to estimate protein abundance [7]. The spec- 
tral counting approaches have gained a fairly large 
degree of use in the community due to their ease of 
implementation, however they generally suffer from a 
limited dynamic range and they are insensitive to 
small changes in expression level due to the large 
number of species which have peptides observed only 
1-3 times [5]. Label-free AUC approaches generally 
overcome these limitations by locating a peak in the 
retention-time and accurate-mass matrix using sophis- 
ticated software, and extracting the area under the 
LC-MS peak. An important characteristic of AUC 
label-free studies is that they need to be performed on 
high resolution instruments for the best results, which 
limits the application of this approach to more expen- 
sive QToF, FT-ICR, or Orbitrap instruments. 

Analysis of shotgun proteomics 

Error in protein-level quantitation can first occur due to 
incorrect peptide identifications. Even at a relatively low 



peptide false-discovery rate (i.e. 1%), the fraction of pro- 
teins detected that contain at least one false peptide 
identification is much higher because multiple peptides 
match back to the same protein. If a false-positive pep- 
tide is included in the protein level quantitation it can 
cause increased error in the protein-level quantitation. 
This can be partially overcome utilizing only the high- 
est-quality or best-ionizing peptides for protein quantita- 
tion, however in current implementations of "Top 3" 
quantitation, the individual peptide confidence is not 
utilized as an inclusion parameter [8,9]. A second type 
of error in protein quantitation occurs when many 
homologues share a common peptide. In this situation 
the protein grouping algorithm, such as ProteinProphet, 
makes an informed decision about which parent 
sequence a peptide belongs to and typically associates all 
of the peptide intensity to that parent sequence. This 
can deliver erroneous protein quantitation results when 
multiple homologues are present. A final challenge is 
with protein isoforms, post-translationally modified or 
proteolytically processed peptides, which may show a 
biologically relevant and different expression pattern 
than the proteotypic peptides. In these cases, they 
should not be grouped together with the other peptides 
for the purposes of modeling expression. 

This paper describes a statistical model which is 
designed to allow the inclusion and modeling of correl- 
ation structure for the problem of differential expression 
in mass spectrometry proteomics. There are a number 
of different approaches designed for protein level quanti- 
tation. The simplest of these use direct summarization 
of all features/isotope groups/peptides that are iden- 
tified for each protein, such as averaging or robust 
summarization based on quantiles [10], or averaging 
only the most abundant three peptides from a protein 
[8,9]. In addition to these algorithms, there are ANOVA 
approaches for protein quantitation [11] and differential 
expression [12,13]. These are regression models that 
variously include or exclude fixed effects for experimen- 
tal group and random effects to handle repeated mea- 
surements of the same sample (technical replicates). All 
of these approaches rely on protein identifications and 
none make explicit use of correlations between isotope 
groups. While we do not consider the introduction of 
fixed effects for biological phenotypes or the introduc- 
tion of random effects for cases in which we have repli- 
cate measurements from the same sample, the factor 
model we describe is a regression model. Therefore, one 
might introduce these features in a relatively straightfor- 
ward way. 

We present here a metaprotein classification approach 
which demonstrates the use of coexpression in addition 
to identification of peptides to assist in grouping with 
similarly-quantified peptides. We note here the specific 
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use of the term metaprotein. This is because, while many 
metaproteins obtained from fitting our model are domi- 
nated by peptides from a single protein, it is entirely 
possible that a "metaprotein" is not representative of a 
single protein at all Rather, a metaprotein may contain 
peptides from multiple proteins. In such a case, the 
metaprotein is representative of the activity of a path- 
way. The extent to which the model is able to distin- 
guish individual proteins within a pathway will depend 
on the sample size of the data set, however, even for very 
large sample sizes this may be impossible for proteins 
that have highly correlated expression. We would argue 
that distinguishing these is somewhatacademic in this 
case, and that noting that they are highly correlated, 
which is a feature of our approach, offers advantages in 
terms of higher power in subsequent hypothesis testing 
and model fitting (because the resulting metaproteins 
will be more independent than the results of protein 
level quantitation). Our approach has the following fea- 
tures, some of which are shared by the models described 
above: 

• Allows for the subtraction of large scale correlation 
structure between proteins that likely arise from 
technical rather than biological variability (batch 
effects). 

• Appropriately models both identified and 
unidentified features of the LC-MS output 

• Utilizes feature identifications from MS/MS spectra, 
but allows for the probability that some of those 
identifications will be incorrect 

• Produces a full posterior distribution on the model 
parameters, which leads to the quantification of 
uncertainty in the results. 

• Admits the possibility that sections of a protein will 
be post-translationally modified and therefore may 
not be representative of the expression pattern of 
the protein as a whole. 

• Makes use of correlation structure across samples, 
which provides significant information about feature 
relationships that is unused in many other 
approaches. 

• Can be used in the creation of predictive models 
based on multiple proteins, rather than just the 
enumeration of proteins associated with a particular 
outcome. 

We recognize that there are many excellent approaches 
to modeling label-free proteomic data that share some of 
these properties, however our proposed model is unique 
in its ability to concurrently model all of them. 

In addition to advances in statistical modeling of label- 
free, unbiased proteomics data, this paper presents a novel 
pre-clinical predictor of response to therapy in patients 



with Hepatitis C. Finally, we demonstrate the validation of 
that predictor in two separate cohorts. First, we show that 
the approach generates a predictor that is reproducible 
between two different labs that are utilizing entirely differ- 
ent mass spectrometry technologies. Second, we show that 
the predictor is able to accurately differentiate clinical 
responders from non-responders in a novel cohort of 
pediatric patients with Hepatitis C. In both cases, the pre- 
dictor is based on unbiased, label-free mass spectrometry 
data. We are aware of one previous example of the valid- 
ation of shotgun proteomics findings with further shotgun 
proteomics experiments [14], in which a number of indi- 
vidual peptides were validated as markers of central ner- 
vous system lymphoma. Standard practice is to validate 
using a targetted platform such as selected reaction moni- 
toring (SRM). To the best of our knowledge, ours is the 
first example of the validation of a full predictor. Finally, 
we verify some of the key elements of the predictor in our 
original cohort using SRM. 

Results and discussion 

Metaprotein factor model 

In order to estimate metaprotein abundance, we build our 
model from pre-processed data with intensity estimates 
aggregated to the isotope group level. In our modeling 
approach, we allow the possibility that an isotope group 
will be incorrectly identified, or be correctly identified, but 
have a pattern of expression that is distinct from the bulk 
of peptides from the corresponding protein. In practice, 
this new grouping approach often leads to metaproteins 
which may be dominated by isotope groups from a par- 
ticular protein, but which contain isotope groups from 
other proteins as well. 

Let X be a P x Af-dimensional matrix consisting of 
measurements on P isotope groups across N samples. 
We utilize a modification of the latent factor model out- 
lined previously in [15-18]. 

X = ftl N +AA+e (1) 

The P-dimensional vector p has elements fa represent- 
ing the mean expression of isotope group i and 1 N is a 
column vector of ones. The N x AT-dimensional matrix A 
represents latent factors which will be learned from the 
data and A is a P x /C-dimensional matrix of factor load- 
ings with elements a i)k . The random variable e is a P x N 
matrix of idiosyncratic noise. 

Our goal is to estimate relative protein concentration 
from this model using the latent factors in A. Recall that 
we have identifications for some subset of the isotope 
groups. With this in mind, suppose we identify each col- 
umn of A and the corresponding column of A with one 
identified protein. If we set a i>k = 1 when isotope group i 
is from a peptide identified as coming from protein k 
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and <2;,£ = 0 otherwise, then our model is describing the 
expression pattern of each isotope group as a noisy 
approximation of the expression pattern of the protein, 
where the protein is known. 

Retaining, for the time being, the idea of fixing in 
this way, we wish to handle the possibility of changing 
sensitivity and changing protein concentration from 
sample to sample. To account for this, we introduce an 
additional set of latent factors into equation 1. 

X = {tl n +BH' + AA +e (2) 

We now introduce latent factors H and factor loadings 
B = (pi^j) where / = 1 • • •/ which we use to account for 
systematic structure in the data that is sample specific. 
Because these features will span almost all peptides, we 
utilize a generic Gaussian prior for the elements of B. 

bij :N(m 0 ,v 0 ) 

This distribution represents our belief that these 
effects span all isotope groups, but with varying effect 
sizes. This prior also minimizes identifiability issues be- 
tween B } which is not sparse, and A which is very sparse 
with informative priors. 

Rather than assigning isotope groups to factors based 
purely on identification, we want to utilize a prior on A 
that allows for possible post-translational modifications 
and misidentifications. With this in mind, we want to 
relax our strict assignment of zeros and ones in the 
loadings matrix A. Instead, our prior distribution for 
will reflect our level of certainty that we know which 
factor should represent the expression of this peptide. 
When we have an identification for peptide i and have 
mapped that peptide to protein /<, our prior distribution 
will reflect an increased certainty that a^^O. 

We introduce a ^-dimensional vector of latent vari- 
ables fa) which identifies the non-zero column of A for 
each isotope group. When we have an identification that 
suggests that isotope group i comes from protein /<, our 
prior distribution for Zi is 

Zi : Multinomial (1, qi) 
q t : Dir(a 0 , a 0 , a*, a 0 , •••a 0 ) 

where a/ c is substantially larger than ao to reflect our 
prior belief that Zi = k. We default to a/ c = 500 ao, but 
have tried values from 100 through 1000 and these lead 
to only minor shifts in metaprotein membership. As the 
weight of this prior decreases we are decreasing the im- 
portance of identification information and placing pro- 
gressively more importance on correlation structure. We 
find, for the Hepatitis application below, that the associ- 
ation of metaproteins with outcome doesn't substantially 
change until we increase the weight of the identification 
data to very high levels. We find that using ao = 1 leads 



to interpretable metaproteins without loss of association 
with the outcomes. For peptides which do not have 
identifications, we utilize an unbiased prior Zi : £>zr(ao). 
Because different peptides showing similar expression 
patterns may, nonetheless, show a different magnitude 
of expression of that pattern due to the relative sensitiv- 
ity of the mass spectrometer for the peptide, we model 
each of the non-zero elements of A independently, such 
that dijc ' N(m a , v a ) when z\ = k and = 0 otherwise. 

There is not a specific threshold for determining the 
grouping of isotope groups into metaproteins. Instead, 
the assignment of an isotope group to a particular meta- 
protein is a function of the variance associated with that 
isotope group, the number of isotope groups already 
assigned to each metaprotein and the level of agreement 
between the expression pattern of the isotope group and 
that of the metaprotein. All of these things are estimated 
using MCMC within the context of the model fitting. 

We note that, in the limit as a/ c — >oo we obtain an 
ANOVA model in which there is no uncertainty about 
which metaprotein each isotope group belongs to. This 
is a fixed effects model with feature -specific variance 
similar to Clough et al. [11]. That limiting model implies 
that identifications are assumed to be accurate and 
assumes that post-translational modifications are of 
minor importance. Clough et al. correctly point out that, 
by collecting features one obtains higher power for 
detecting associations versus simple tests of association 
with individual isotope groups. That model, as well as 
the one we present here, may be expanded with add- 
itional design vectors identifying experimental groups or 
particular interventions if so desired. These may be 
included as columns in H which are simply not updated 
in the MCMC (Markov chain Monte Carlo). 

To complete the model specification, we assume a 
conjugate, row specific inverse gamma prior for the 
variance of e. This is similar to the protein level aggre- 
gation model of [11], and allows differing variance 
estimates for each isotope group. Because we are 
working with a relatively large number of samples 
(and thereby have no issues with identifying variance), 
we use a prior with mean 1 and variance 100. We also 
assume that the individual columns of A arise from a 
uniform distribution on the Af-dimensional sphere of 
radius y/N. The model is fit via MCMC and the result 
of this fit is a set of draws from the posterior distribu- 
tion of all of the model parameters. All prior distribu- 
tions are conjugate, and therefore we may use Gibbs 
sampling to update the model parameters at each step 
of the MCMC. The data sets we are modeling have 
been aggregated at the isotope group level, and as 
such they have between 20 and 40 thousand measure- 
ments per sample. While our sampling scheme is able 
to fit this data in just a few hours on a desktop, we 
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expect that some sort of parallel processing will be de- 
sirable for data that is aggregated at the feature level 
We have tested our model on multiple simulated data 
sets of various sizes (both sample size and number of 
isotope groups) to verify the accuracy of the parameter 
recovery even in the presence of intentionally mis- 
labeled isotope groups. 

Overlapping peaks 

By our assumption that each row of A have only one non- 
zero entry, we have restricted our peptides to belong to 
just one metaprotein. As an alternative, one might allow 
more than one non-zero element in each row of A. This is 
equivalent to assuming that more than one metaprotein is 
responsible for the expression pattern seen in a single pep- 
tide. This might occur in cases where multiple isotope 
groups have highly overlapping peaks or where multiple 
proteins have homologous regions that can give rise to the 
same peptide. Although the extent to which we see mul- 
tiple isotope groups in a single peak is unclear, this type of 
structure can be accounted for with relaxed priors on A. If 
a^k is an element of A, then a point mass mixture prior 
accomplishes our goals. 

• (l-^OSo + qkN(a iik \mo, v 0 ) (3) 

where 5 0 is the distribution describing a point mass at 0. 
This distribution represents the prior belief that some, but 
not all, metaproteins will be required to describe the ex- 
pression pattern of each isotope group. The normal distri- 
bution allows the magnitude of the effect to vary. For each 
of the metaproteins, we estimate the number of associated 
isotope groups by our prior distribution on the mixing 
probability q k : 

q k :Be(v 0l Yo) 

This approach allows for the restriction on isotope group 
association with just one metaprotein to be relaxed. How- 
ever, as the resolution of mass spectrometry increases, and 
as fractionation in multiple dimensions (such as 2D chro- 
matography and ionmobility) makes the distinction be- 
tween polypeptides clearer, this modification to the model 
will become less and less important. Further, experience 
suggests that the vast majority of measured peaks are sin- 
gle species. Because of this, the addition of features to deal 
with overlapping peaks can introduce more noise than it 
removes, particularly when the number of samples in the 
experiment (and therefore the amount of information 
available from the correlation structure) is limited. 

Features of the factor model 

One of the strengths of our approach is the ability to 
collect isotope groups based not only on identifications, 
but also on their coexpression across samples. Perhaps 



the most common method for visualizing correlation 
structure in high-dimensional data is hierarchical clus- 
tering. However, this is most typically used as a 
visualization strategy and does not, by default, provide 
quantitative estimates of aggregate behavior. While it is 
possible to generate models based on what is visualized 
from hierarchical clustering, nearest centroids for ex- 
ample, these have not to our knowledge been published 
for proteomic data. In addition, there are questions sur- 
rounding how one might combine peptide identifications 
and correlation structure in a principaled way to jointly 
model all of the available information. 

One can identify collections of coexpressed isotope 
groups as well as an approximation of the expression pat- 
terns of each group from our model based on posterior 
distributions of the model parameters. The posterior para- 
meters of greatest interest will depend on the specific ap- 
plication, but often we will be most interested in the 
vector of factor memberships, z, which describes which 
peptides group together most often. In data sets intended 
for the generation of predictive models in clinical/transla- 
tional studies (as well as other types of studies), we will be 
interested in A. The columns of this matrix define our 
estimates of the expression patterns of the metaproteins 
across our samples. These can be used to estimate fold 
change, or can be treated as independent variables in any 
type of model that is appropriate for the study. 

Associated with meta-protein i is a column of factor 
scores, A i} representing the expression of that meta- 
protein. In addition, there is a collection of isotope groups 
which make up that meta-protein, the isotope groups /for 
which Zj = i. Figure 1 shows a heatmap of all of the pep- 
tides from the dataset that are identified as belonging to 
the protein Apo E. Note that, while the majority of those 
peptides share a common expression pattern, three .(la- 
beled 45, 31 and 53)show a very different, conflicting pat- 
tern. Our meta-protein model automatically groups the 
co-expressing peptides into the same factor while assign- 
ing the peptides with conflicting patterns to other meta- 
proteins that more closely match their expressionpatterns. 

Inconsistent expression of peptides from the same protein 

We define a "dominant metaprotein" for a protein to be 
the metaprotein(s) with more than half of its identified 
isotope groups contributed by that protein. We note 
that, because there are many metaproteins it is possible 
for a dominant metaprotein to consist largely (or even 
entirely) of peptides from one protein, but not contain 
all (or even a majority) of the isotope groups from that 
protein. One of the features we have observed from 
studying posterior parameters from our model is that 
there are manyexamples in which an identified isotope 
group (one with a peptide and protein label) does not 
follow the expression pattern of its corresponding 
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Samples Samples Samples 

Figure 1 Peptides from Apo E. A heatmap of all of the isotope groups in the data set that are identified as originating from the protein 
Apoliprotein E. The numbers on the y-axis indicates which metaprotein that peptide was assigned to. The peptides are ordered from top to 
bottom from highest to lowest mean intensity across the samples. Note that, while the majority of those peptides share a common expression 
pattern (those assigned to metaprotein 16), three show a very different, conflicting pattern. Our meta-protein model automatically groups the co- 
expressing peptides into the same factor while assigning the peptides with conflicting patterns to meta-proteins that more closely match their 
expression patterns. Each row in each heatmap has been standardized to have mean zero and standard deviation one. Red is a relatively high 

level of expression and blue a low level of expression. 
^ J 



dominant metaprotein. That is to say, for any given pro- 
tein there is often a "consensus" expression pattern that 
many of the isotope groups from that protein follow, but 
that there is also a large minority of isotope groups which 
do not follow that expression pattern. We fit our model to 
all 109 proteins which have more than 1 identified peptide 
in the data set. Heatmaps similar to Figure 1 for each of 
these proteins are available in Additional file 1. Examin- 
ation of these figures shows that the presence of peptides 
that show expression patterns significantly different from 
their corresponding dominant metaprotein is the rule, 
rather than the exception. 

There are a few reasonable explanations for this. The 
most obvious possible explanation is that the poorly con- 
forming peptides are those with the lowest overall inten- 
sity, and therefore subject to smaller signal to noise ratios. 
However, examination of heatmaps showing the exact 
same peptides, but now sorted by mean intensity across 



the samples rather than by meta-protein membership 
demonstrates that there is not a strong predominance of 
low intensity peptides among those that do not coexpress 
with the other peptides from the protein. All 109 of those 
heatmaps are available in the Additional file 2. We tested, 
using a non-parametric Kruskal-Wallis test, for association 
between meta-protein membership and mean signal inten- 
sity for each ofthe metaproteins, and found that, of the 
109 proteins tested, only 2 showed significant association 
(p-value < 0.01, APOB and CERU). 

Another possible explanation for the presence of poorly 
co-expressing peptides within a single protein is misalign- 
ment between runblocks. The data set was analyzed in 
three runblocks, one of which occurred months after the 
original two, and aligned according to the algorithm 
described in the methods section. There are sometimes 
shifts in retention time between runblocks which may lead 
to misalignment, however, we expect misalignment to be a 
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rare occurrence. Additionally, if peptides are misaligned, 
we would expect to see a peptide that coexpresses with its 
dominant metaprotein well in two run blocks but is mis- 
matched in the third, and this is not generally the case. 
We are able to find some examples that fit this pattern, 
however, it is almost always the case that when a peptide 
does not share an expression pattern with its dominant 
metaprotein one runblock, it also does not share that pat- 
tern in the other run blocks. 

A third explanation for peptides that are uncorrelated 
is mis-identification. However, we are using identifica- 
tion algorithms with parameter settings that lead to very 
low (approximately 1%) false identifications. We expect 
around 34 such misidentifications, assuming that identi- 
fication is correct 99% of the time (based on the 3398 
total peptides with identifications). In fact, examining 
the list of peptides that do not belong to their dominant 
metaprotein, we see that more than half fall into this cat- 
egory (1640 out of 3398). This is true despite our prior 
distribution assigning a 500x greater likelihood of a pep- 
tide belonging to its dominant meta-protein than to a 
different metaprotein. 

Post-translational modification of proteins is a well 
known process, but it is unclear how extensive these 
modifications are. If proteins are extensively and dynam- 
ically modified after translation, then we should expect 
many of them to exhibit expression patterns that do not 
match the bulk of peptides from that same protein. Also, 
if peptide modification is a significant contributor to 
observed patterns of expression, then we also expect to 
find peptides that have targets for post-translational 
modification to be more likely to be found outside their 
dominant meta-protein. We examined the probability of 
peptides containing Glutamine and Asparagine, which 
are known sites of deamidation, to belong to their dom- 
inant metaproteins. Correcting for the number of pep- 
tides inside and outside their dominant metaproteins, 
we find that peptides containing Glutamine are approxi- 
mately 1.2 times more likely to not follow the dominant 
expression pattern for any given protein, and that this is 
a statistically significant difference (p-value .0013, fisher s 
exact test). In addition, peptides with Asparagine are 
1.22 times more likely to fail to coexpress with the dom- 
inant group of peptides from a protein (p-value 0.0010). 
In addition to these two sites ofpost-translational modifi- 
cation, we examined the motif NxS/T, which is a known 
site of N-linked glycosylation. For these two motifs, 25 
of the 30 peptides which contain the "NxT" motif and 
53 of the 60 peptides which contain the "NxS" motif fol- 
low expression patterns that are different from their 
dominant metaproteins (odds ratios 4.3 and 6.6 respect- 
ively, p-values 0.0013 and 2.1e-8 respectively). Addition- 
ally, both Serine and Threonine are known to be sites of 
O-linked glycosylationas well as phosphorylation. We 



find that both Threonine and Serine are also more likely 
to show odd expression patterns (Threonine: odds ratio 
1.2, p-value .002 and Serine odds ratio 1.3, p-value 8.1e-6). 
We tested Proline (odds ratio = 1, p = .96) and Histidine 
(odds ratio = l.l,p = AO) as negative controls. 

We note that we are not directly detecting the post- 
translationally modified peptides. Instead, we suspect that 
we are detecting changes in the expression levels of the un- 
modified peptides due to post translational modification. 
Thus, peptides with any of these post-translational modifi- 
cation motifs are significantly less likely to follow the dom- 
inant expression pattern for the protein from which they 
are derived. It is possible that post-translational modifica- 
tion is a pervasive feature of plasma proteomics, and that 
protein level quantitation is likely to either introduce errors 
by summing across uncorrelated parts of a protein 
(if quantitation is accomplished through summation across 
all associated peptides) or to miss critical post-translational 
modifications (if quantitation is accomplished by summa- 
tion of only the top three peptides). It is important to note, 
however, that inconsistent expression of peptides from the 
same protein is likely due to multiple different causes, 
including protein isoforms, increased noise in low abun- 
dance peptides, misalignment, mis -identification, peptide 
modification, and sample to sample variation in ion sup- 
pression from variable coelution in the LC separation. 
Regardless of the root cause, the statistical model we 
propose will identify isotope groups that show this charac- 
teristic and treat them appropriately. 

Analysis of spike-in data: Relabeling of misidentified 
peptides 

We obtained a publically available spike-in data set from 
[19] which consists of three replicate measurements of 
each of six conditions. The base solution is solid-phase 
N-glycocapture from human serum, and each of six differ- 
ent non-human proteins were spiked in at each of six dif- 
ferent concentrations in a latin square design. After fitting 
our model to this data, we find that there is strong evi- 
dence for a significant increase in the total number of iso- 
tope groups that could be identified as coming from the 
known six spiked in proteins (Additional file 3). Estimated 
fold-change for each protein was also calculated for each 
experimental group and each spiked in protein. We find 
that, while the estimation works well for the highest four 
groups for each spiked in protein, the lowest two groups 
are typically underestimated due to the significant level of 
missing data. 

In order to test the ability of our model to identify and 
correct inaccurate identifications, we randomly permuted 
10% of the identifications in this data and refit our model. 
This experiment was repeated 100 times. Additional file 3 
shows the results of this experiment for the 'MYG_- 
HORSF metaprotein for one such permutation. Note that 
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there are 3 isotope groups that are correctly assigned to 
the MYG_HORSE metaprotein even though the original 
identification would have placed them in other proteins. 
This experiment was repeated 50 times, and of the 295 
total reassigned identifications across those experiments, 
223 were correctly reassigned to their original protein. Of 
the remainder, the peptide was fit to a noise factor 61 
times and fit to some other protein 11 times. None were 
assigned to the protein associated with the incorrect 
relabel. 

Comparison with protein level quantitation 

While there are similarities between our meta-protein 
model and various techniques for protein level quantita- 
tion, the ability to group peptides based on co-expression 
across all samples, and therefore identify peptides that 
show evidence of post-translational modification is a crit- 
ical difference that is shared by no protein level quantita- 
tion method that we are aware of. Nonetheless, it is 
interesting to compare our model to protein level quan- 
titation algorithms. For this purpose, we will examine 
two such algorithms. The summation algorithm esti- 
mates protein level quantitation by summing total 
expression across all peptides from a protein. This algo- 
rithm automatically gives peptides with high intensity 
measurements a larger effect on the estimated protein 
level quantitation. A second algorithm, Top 3, estimates 
the protein level expression as the mean of the three 
peptides with the highest intensity (average across the 
samples). These two algorithms typically give similar 
results, however, because peptides from the same pro- 
tein do not always show consistent expression patterns, 
these approaches can lead to protein level quantitation 
that is unnecessarily noisy. 

In general, we find that the correlation between our 
metaprotein model and protein level quantitation for 
estimation is high when there are a large number of pep- 
tides from the given protein. For example, one of the most 
abundant proteins in this data set is Apolipoprotein B 
(represented by 409 isotope groups), and correlation 
between estimated expression from our factor model and 
from summation of all Apo B identified peptides is 0.97. 
However, when there are fewer peptides available or if 
there are many misidentifications or modifications, we 
find evidence that our factor model gives improved esti- 
mation of protein level expression patterns. For example, 
pregnancy zone protein, which has only three associated 
isotope groups in the data set, is known to be over 
expressed in women compared to men. While both the 
metaprotein model and the summation algorithm show 
differential expression for this protein, the factor model 
gives a j?-value of 4.5e-4 (statistically significant even after 
Bonferroni correction for multiple hypotheses) as com- 
pared to a ^-value of .0014 for estimation by summation 



over identified peptides (not significant after multiple hy- 
pothesis correction). 

Of the 96 subjects in this study, we have available anti- 
body assay mesurements of both Apo B and Apo E on 
38. We compared the two protein level quantitation 
algorithms and our metaprotein model to the antibody 
assay "gold standard". Correlationsare all generally high 
and examination of Figure 2 shows that the three techni- 
ques are generally in agreement, even on outliers. The 
top three isotope groups identified as coming from Apo 
B show a high level of correlation, and all of these pep- 
tides are members of the main Apo B metaprotein. Also, 
a large majority of the Apo B peptides show this same 
expression pattern and are assigned to the same meta- 
protein, thus agreement between the three methods on 
Apo B is not surprising. 

However, examination of the top three isotope groups 
from Apo E (Figure 1) shows a different picture. The 
second most abundant isotope group from Apo E is in a 
different metaprotein because it shows a substantially 
different expression pattern from the bulk of the Apo E 
isotope groups. In addition, if we delete the two outliers 
from the data (they are outliers by all three quantitation 
methods), the correlations between the three top Apo E 
isotope groups and the antibody assay of Apo E activity 
are.59, .23 and .56 respectively (j?-values of .0002, .17 
and .0004 respectively). Thus, this second most abun- 
dant isotope group should be adding noise to the Top 3 
protein level estimate of Apo E. Interestingly, the correl- 
ation between the Top 3 estimate and the antibody assay 
is .60. This is higher than any of the three separately, 
which suggests that the antibody assay is in fact measur- 
ing an aggregation of two different forms of Apo E. 

Metaprotein expression in a hepatitis C cohort 

In addition to the analysis of a publically available latin 
square data set (Additional file 3), we obtained pre- 
treatment serum samples from 96 patients with Hepatitis 
C who have a known response or non-response to the 
standard of care treatment with interferon and Ribavirin 
[20]. Serum from the patients was measured with open 
platform LC-MS/MS. The overall goal of the study was to 
predict who among the study subjects will respond to 
therapy and who will not. We are also interested in 
estimating which proteins and peptides are potential mar- 
kers of response, allowing for future, targeted assay 
development. 

Analysis of this data set proceeds in two steps. First, 
the model described above is fit to the proteomic data 
without regard to the phenotype of interest. There were 
a total of 6,729 peptides in the data set with either posi- 
tive identifications or with average expression levels 
greater than the mean. Of these 3390 had identifications. 
These were matched with 265 different proteins of 
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protein level expression. We note that the three algorithms are in agreement to a high degree. Correlations for Apo B are .55, .54 and .56 for 
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Antibody assay values, which measure protein per volume, were converted to protein per total protein using total protein levels from a Bradford 
assay. All measurements were standardized to mean zero and standard deviation 1. 
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which 109 had two or more associated, identified 
peptides. 

Prediction of outcome 

As with the computation of protein level expression, our 
109 metaproteins may be used in any context as inde- 
pendent predictor variables. Additionally, we may assess 
the level of association between metaproteins and other 
biological phenotypes in either case. 

Our first step in the analysis of the posterior distri- 
bution involves comparing the mean metaprotein 
expression patterns for all metaproteins to the "re- 
sponse to therapy" phenotype. We find three such 
metaproteins to be significantly associated (ANOVA 
^-value <9.8e - 5) even after correction for multiple 
hypothesis testing (ZA2G, ITIH and HRG). Protein 
level quantitation by summation yields only 2 and pro- 
tein quantitation by top 3 yields zero. Furthermore, 
the relevant expression pattern is far clearer for the 
metaprotein analysis. Figure 3 shows the most predict- 
ive "protein" (ZA2G), as computed by the summation 
protein quantitation algorithm. Examination of the 
metaprotein with the most peptides from ZA2G, built 
from our model, shows that ZA2G is indeed identified 
as predictive by the metaprotein model as well. How- 
ever, our metaprotein model identifies only those pep- 
tides from ZA2G that are highly correlated with each 
other, and it additionally identifies a number of other 
peptides from other proteins that share the same 
expression pattern across the samples (also shown in 



Figure 3). The result is better separation between 
responders and non-responders. 

In addition to strong associations for three metapro- 
teins, we find that there are a total of 13 metaproteins 
with ^-values less than .01 (random association would 
dictate only 1). Thus there is clear evidence of the pres- 
ence of blood-borne markers of response to therapy in 
Hepatitis C. 

Identification of candidate peptides 

We would like to identify a set of candidate peptides for 
use in future targeted studies such as selected reaction 
monitoring (SRM) or antibody studies. From the analysis 
of associations between averaged metaproteins and the 
phenotype, we are confident that there are markers of 
interest. In order to identify the most relevant isotope 
groups, we propose to obtain draws from the MCMC 
chain, and for each draw build a predictor and observe 
which peptides are included in that predictor. In this 
way, the values of A are computed directly from the 
peptides included in the corresponding metaprotein. 
Additionally, by keeping track of which peptides are 
most often included in the predictors, we obtain a list of 
candidate biomarkers for future study. 

Because we have 109 metaproteins but only 87 sam- 
ples, direct regression is not possible in this context. We 
instead use variable selection with model averaging. 
Variable selection allows regression with a small subset 
of the total number of predictors, while model averaging 
allows us to properly account for uncertainty in which 
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Figure 3 Comparison to protein summarization. Comparison of metaprotein and summation approaches to aggregation of large numbers of 
isotope groups. Panels a (runblock 0) and b (runblock 1) show all of the isotope groups identified as coming from the protein ZA2G. This is one 
of the two proteins significantly associated with the response to therapy phenotype in patients with Hepatitis C. Samples are on the x-axis and 
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models are correct. In this context, model averaging has 
been shown to outperform the single best model for pre- 
dictive accuracy on hold out data sets [21]. We use a 
publicly available implementation of variable selection 
and model averaging called Shotgun Stochastic Search 
[22]. 



As mentioned above, our analysis consists of two 
steps. The first is the generation of a factor model to 
explain the variation seen in the peptide concentration 
data. This step is unsupervised; it does not take into 
account any phenotype data in any way. In this step we 
are seeking only to describe correlation structure in the 



Lucas et al. BMC Bioinformatics 2012, 13:74 
http://www.biomedcentral.eom/1 471 -21 05/1 3/74 



Page 11 of 18 




False Positive 

Figure 4 Simulation ROC curves. Receiver operating curves 
generated from 200 draws from the MCMC chain (red) compared to 
ROC's generated from predictors of randomly generated outcomes 
(blue). The random predictors were generated from a mean zero 
multivariate normal distribution using a covariance matrix generated 
from the factor matrix A for the corresponding MCMC draw. The 
outcome variable (response/non-response) was then permuted 
before the 'random predictor' model was fit. 



in order to measure mass-to-charge ratio. Differences in 
these machines and differences in protocols between 
laboratories can make reproducing the same results at 
different labs difficult— even when the samples used are 
exactly the same. Second, validation of findings in new 
samples can be difficult, particularly when the validation 
approach involves the use of different sample prepar- 
ation or different techniques for measuring the peptides 
or proteins. It is quite common to see two phase experi- 
mental designs in which candidate peptides/proteins are 
identified by shotgun mass spectrometry then those can- 
didates are validated on a much larger group of samples, 
but using antibody assays or SRM mass spectrometry. 
These approaches introduce additional variables in the 
sample preparation or measurement, and in cases where 
validation fails the explanation for that failure becomes 
uncertain— it may have failed because the peptides/pro- 
teins are not good candidates or because of the changes 
in the way they were measured. 

We validated our results on the predictor of response to 
therapy by both testing the consistency of the predictor 
when measured in different labs and also when testing the 
accuracy of the predictor in a new set of samples. In each 
case, we are attempting to validate our predictor through 
the use of an additional unbiased, label-free data set. 



isotope groups. In principal, if there were another large 
open platform plasma proteomics data set available this 
step could be performed on a different data set entirely. 
Prediction of the phenotype from the metaprotein 
expression through the use of variable selection and 
model averaging in a binary regression model constitutes 
the second, supervised step. Results from this analysis 
will vary slightly at each step of the MCMC chain. This 
allows us to estimate both the accuracy of predictors 
generated by this model as well as the uncertainty in 
that accuracy. Figure 4 shows receiver operating curves 
(ROC) of the model for 200 steps of the MCMC, and 
compares this to the same ROCs from predicting ran- 
domly generated phenotypes with the same number of 
cases and controls. We see that the accuracy is signifi- 
cantly better than chance for all 200 draws from the 
Markov chain. The collection of most used isotope 
groups in this modeling approach are shown in Figure 5. 
We show the locations of the associated peptides (those 
with identifications) in their respective proteins. 

Validation of a predictor discovered in an unbiased, label- 
free data set 

There are two significant challenges facing users of 
label-free, unbiased, mass spectrometry proteomics. 
First, there are a number of different mass spectrometry 
machines, and some utilize different physical principals 
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Figure 5 Predictor content. Shows the locations of the 600 most 
used peptides which were also identified. Proteins with fewer than 
1% of their peptides represented in the figure were filtered out. 
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Figure 6 Validation. Validation of a predictor of response to 
therapy. A metaprotein model was fit to the training data (outlined). 
This model was then used to predict response to therapy in a 
subset of the training samples that was measured in a different lab 
using a different machine and different protocols. In addition, it was 
used to predict response to therapy in a new set of pediatric 
patients. In both cases, the validation was successful. For the Peds C 
cohort, the Area under the ROC curve is .8 and the p-value is 
4.4 x10" 4 . 



Because of difficulties with alignment between data sets, 
this approach is challenging and is usually not taken. The 
approach to this type of validation undertaken successfully 
by [14] involved the validation of multiple individual 
makers of disease. That approach produces a set of puta- 
tive biomarkers, some of which validate in the follow-up 
data and some of which do not. It leads to a more highly 
filtered, and therefore more likely to succeed, list of bio- 
markers, but it does not lead to a validated predictor. 
Validation of a such a predictor must take into account 
uncertainties associated with alignment and false discov- 
ery. The advantage of our approach for this purpose is in 
the use of factors as predictors rather than individual 
isotope groups. By aggregating multiple isotope groups 
and using the aggregated expression pattern, the 



metaprotein model produces predictions that are robust 
to misalignments or false discovery associated with indi- 
vidual analytes. 

We reanalyzed 28 of our original samples in an inde- 
pendent laboratory on a different mass spectrometry 
machine using a different technology (orbitrap versus 
time of flight). Those samples were analyzed using the 
protocols of the independent lab without modification. 
In addition to this, 51 additional samples were obtained 
from a pediatric cohort [23], and the predictor was eval- 
uated on this cohort. All data sets were aligned to each 
other (as described in methods), and a predictor was 
built using our original data (as described above). This 
predictor was then tested for accuracy in each of the 
two additional data sets. Figure 6 shows that in both the 
independent laboratory measurement and the independ- 
ent validation were successful. The area under the 
receiver operating characteristic (ROC) curve for the 
new, untrained pediatric patients was .8 and the pre- 
dictor was significantly different between patients with a 
sustained viral response compared to those who did not 
respond to therapy (p-value 4.4 x 10" 4 by £-test). These 
two results represent a validation of a clinically relevant 
predictor of disease state from label-free, unbiased mass 
spectrometry by further label-free, unbiased mass 
spectrometry. 

Verification of differential expression for individual peptides 

In order to verify our findings, we identified 87 peptides 
which were subsequently targeted for quantitation with 
selected reaction monitoring (SRM) without the addition 
of stable-labeled peptides, similar to the LF SRM method 
described in [24]. Of the samples that were measured 
with the shotgun proteomics approach, 25 from the first 
two run blocks (55 samples) were measured as well as 
38 from the third run block (41 samples). In order to 
test whether we can generate predictors of response to 
therapy in patients with Hepatitis C, we trained a regres- 
sion model on the 25 and validated on the remaining 38. 
All samples were used in the selection of peptides to tar- 
get with SRM, therefore this does not constitute a true 
validation of an SRM based predictor. However Figure 7 
shows that our predictive accuracy on the held out sam- 
ples is quite good. This verifies, on the same samples 
and for a subset of the peptides, that our initial findings 
in the open platform can be reproduced by measure- 
ments of individual peptides. Because we utilized the 
shotgun proteomics results from all samples to select 
isotope groups for SRM, this does not address the ques- 
tion of out of sample accuracy, however, we have 
addressed this issue separately in Patel et al. [20]. 

In addition to biomarker verification, we sought to val- 
idate some examples of isotope groups which were 
determined not to follow their dominant metaprotein 
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Figure 7 Verification of predictor. Predictive accuracy of the SRM assay for predicting response to therapy in Hepatitis C. Training was on the 
initial 25 samples, and validation was on the followup 38. Panel a shows behavior of the model on the trained samples and panel b shows 
behavior on the hold-out samples. The ROC curve in panel c shows accuracy only on the hold-out samples. 
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Figure 8 Verification of poor coherence. Peptide that doesn't follow the pattern of it's dominant metaprotein. Panel a shows the correlation 
between the measured intensity of the polypeptide ^PP^ATPIR from the label-free, unbiased platform compared to the label-free SRM platform. 
The high level of correlation demonstrates the reproducibility of measurement of this peptide. Panel b shows the pattern of expression, as 
measured in the unbiased platform, as compared to the pattern of expression of the FINC metaprotein. Panel c shows a scatterplot comparing 
the pattern of expression of the FINC metaprotein and the intensity of the peptide as measured with the shotgun approach and panel d shows 
the same comparison, but now with the SRM approach instead of the shotgun approach. The peptideis measured consistently and it does not 
correlate with the expression pattern of the FINC metaprotein. 
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during the analysis of the original shotgun proteomics 
assay. An example is shown in Figure 8. The polypeptide 
"TTPPTTATPIR" from the protein FINC, which is 
known to be potentially O-glycoslyated at two residues 
[25], has a consistently-measured expression pattern on 
both SRM and QToF platforms (Figure 8a). This peptide 
exemplifies the type of peptide the Metaprotein 
approach would not group into the dominant metapro- 
tein for FINC, since it does not correlate with the 
expression pattern of other FINC peptides (Figure 8b) or 
the FINC metaprotein (Figure 8c and 8d). Intensity data 
from the SRM experiment is available in Additional file 4. 

Conclusions 

To date, the algorithms and models for aggregating mass 
spectrometry features in larger groups than peptides 
have relied on protein identifications and do not use cor- 
relations across samples in any way. The model we have 
described makes use of these correlations to identify 
peptides that do indeed show consistent expression in 
addition to agreement in identification. This paper 
describes both a novel statistical approach for the ana- 
lysis of label-free, unbiased mass spectrometry proteo- 
mics and an approach for the validation of discoveries 
on that platform by further experiments on the same 
platform. Furthermore, we have demonstrated the appli- 
cation of this approach in the context of the treatment 
of Hepatitis C, which is a disease causing significant 
morbidity and mortality worldwide. The predictor of 
response to therapy described in this paper has the poten- 
tial to significantly affect therapeutic decision making. 

The model offers an approach to aggregation of mass 
spectrometry proteomics data that differs from protein 
level quantitation, and that in some situations, offers sig- 
nificant advantages over previous approaches. The crit- 
ical difference between protein level quantitation and 
our metaprotein model is the inclusion of correlation 
structure in the metaprotein model. We should note that 
the inclusion of correlation structure may offer only 
minimal advantage low sample size situations. Addition- 
ally, in cases where there is an overwhelming biological 
effect, that effect can dominate the observed correlations 
and drown out what would otherwise be differing 
expression patterns. However, we have demonstrated the 



Table 1 Demographics of the training data 





Responders 


Non-responders 


Gender 


13 female, 24 male 


4 female, 14 male 


Race 


50 Caucasian, 1 1 AA 


23 Caucasian, 8 AA 


Log-viral load 


14 + 3.7 


1 5.6 + 1 .6 


Metavir 


1 .6 + 1 .7 


1.7 ±2.3 



utility of the approach for validation studies and for 
potential clinical applications. 

Our model is particularly well suited to situations in 
which we have a relatively large number of samples 
(>30) with a high degree of biological variability. We 
have shown that it can be used to gain insights into 
translational studies, and have exemplified its use with a 
study of response to therapy in patients with Hepatitis 
C. The model is appropriate for any high-quality quanti- 
tative (area-under-the-curve) proteomics data. This is 
not limited to data-independent acquisitions, rather the 
approach can be utilized on any label-free quantitative 
data which is based on precursor intensity for the quan- 
titation and collects enough points across the peak to 
accurately define the peak area. 

Methods 

Sample selection 

Chronic hepatitis C (HCV) patients, n = 96, were selected 
for proteomic analysis from the Duke Hepatology Data- 
base and Biorepository, as detailed previously [20]. Sam- 
ples analyzed were all pre-treatment serum aliquots, but 
patients were classified based on their sustained response 
to Pegylated Interferon/Ribavirin combination therapy, the 
standard of care for HCV. Patients were matched as well 
as possible on the basis of relevant clinical parameters, in- 
cluding gender, viral load, metavir fibrosis score, and race 
(Table 1). Patients were divided between Genotype 1 HCV 
non-responders (n = 42), Genotype 1 HCV responders 
(n = 34), and Genotype 2/3 HCV responders (n = 20). A 
subset of this cohort, including 25 of the original 55 and 
38 of the subsequent 41 samples, were also utilized for 
label-free SRM analysis. 

Sample preparation, instrument operation, and data 
preparation 

The plasma sample preparation by immunodepletion 
and trypsin digestion, as well as the unbiased LC-MSE 
data collection on a nano Acquity and QToF Premier 
mass spectrometer (Waters Corporation), has been 
described in detail previously. [20]. Unbiased proteomic 
data analysis using label-free area-under-the-curve quan- 
titation was performed in Rosetta Elucidator, utilizing 
both Mascot (Matrix Sciences, Inc) and PLGS v2.4 
(Waters Corporation) and exported for statistical ana- 
lysis, also as previously described [20]. Peptide annota- 
tion for this dataset was performed at a 1 peptide FDR 
using decoy database validation. To enable external ana- 
lysis of the quantitative data using alternative methods, 
we have made the two unbiased quantitative data sets 
available (see Additional files 4 and 5). LC-SRM analyses 
were performed using a subset of the same samples, 
which were stored at -80 C between the initial unbiased 
analyses and the targeted analysis. A scheduled SRM 
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method was generated in Skyline v0.7 (https://brendanx- 
uwl.gs.washington.edu/labkey/project/home/software/Sky 
line/begin.view) for 82 of the most promising biomarker 
peptides, and 5 peptides from yeast alcohol dehydrogenase 
(ADH1_YEAST) were also included as internal standards. 
The method included up to 3 transitions per precursor 
ion, and the MS method details as well as quantitative data 
have been included in Additional file 4. Analyses were 
performed on a nanoAcquity UPLC system coupled to 
a Xevo TQ mass spectrometer (Waters Corporation), 
using a method identical to that described for the 
unbiased data, with the following exceptions. The gra- 
dient length was 30 min, the column utilized was a 
75 \im x 150 mm BEH, the flow rate was 0.4 (iL/min 
and column temperature was 35°C. Raw data was 
imported into Skyline, quantified using a label-free 
approach, and exported for statistical analysis. 

Preparation of data for analysis 

To accomplish data alignment and feature quantitation 
across all biological samples and thus form the matrix 
discussed in the statistical methods section below, we 
utilized Rosetta Elucidator™v3.3 software package 
(Rosetta Biosoftware) to import and align all MSE and 
data-dependent acquisition (DDA) raw data files [26-30]. 
Database searches were performed against a forward/ 
reverse Swissprot database (v 56.5) with human tax- 
onomy, using ProteinLynx Global Server v2.4 (IdentityE 
algorithm, Waters Corporation) for MSE searches or 
Mascot v2.2 for DDA data. Database searches are either 
performed externally and results imported (PLGS 2.4) or 
queued directly from within Elucidator (Mascot) to 
allow identification of many of the quantified features in 
the proteomic dataset. All database searches were per- 
formed with high mass accuracy on precursor and prod- 
uct ions (typically 20 ppm precursor and 0.04 Da 
product ion tolerance), with fixed carbamidomethylation 
(Cys), variable oxidation(Met) and variable deamidation 
(Asn and Gin). Annotation of the peptides is accom- 
plished at an estimated 1% FDR using the Elucidator im- 
plementation of PeptideProphet algorithm [31]. Visual 
scripting within Elucidator is utilized to extract feature 
intensities for those features which have quantitative 
values above the 1000 counts (approximately 10th per- 
centile) in 50% of the samples. The final file for statis- 
tical analysis is made up of a matrix of intensities, with 
the rows corresponding to isotope groups and the 
columns to technical observations (LC-MS analysis). An 
isotope group is defined as all of the peaks associated 
with a single peptide at a specific charge state and reten- 
tion time. This level of quantitation combines peaks 
from the same peptide that differ according to the num- 
ber of carbon 13s incorporated, but does not combine 
the same peptide measured at different charge states. 



The intensity of an isotope group for a given sample is 
the total volume under the feature peaks associated with 
that isotope group. This is monotonically related to the 
concentration of that isotope group in the original sam- 
ple, and it is these intensities that we work with. We 
have made the matrices of intensity values associated 
with the study available in Additional file 5. 

Technical variation 

This data set was collected in three run blocks. Two of 
these were consecutive and the third was run months 
later. There are significant batch effects present in com- 
paring the first two to the third even after correcting for 
observed total protein. Even though we did not include 
explicit design vectors for batch in our regression, our 
inclusion of a factor matrix describing systematic effects 
(/? and H) allows this source of technical variation to be 
automatically modeled without foreknowledge. We find 
that the first row of H perfectly distinguishes runblocks 
1 and 2 from runblock 3, with values between .95 and 
1.05 for the former and between - .95 and -1.05 for the 
latter. This ability to bridge between two different data 
sets supports our claim that this model is able to iden- 
tify and subtract out some variation in sensitivity due to 
batch effects. 

Alignment of data sets 

The data from a single experiment consists of a list of 
features along with their associated mass-to-charge 
ratios and retention times. Because there is some level 
of randomness to all of these measurements, there is 
some uncertainty in the identification of which feature 
from one experiment should be associated with a 
given feature in another experiment. The process of 
matching features across experiments is termed data 
alignment. For matching features within a single 
experiment, we utilize Rosetta Elucidator™, which is a 
commercial package for the processing and analysis of 
proteomics data. However, there were sufficient differ- 
ences between the different experiments (which were 
run months apart, at different labs and on different 
machines) to make the Rosetta algorithm inadequate 
for the task of alignment across datasets. For data 
alignment across the different batches, we utilized the 
following construction. 

Let i be an index over the set of all peptides measurable 
in our experiment. Further, define to be 1 if the ith pep- 
tide was measured in the experiment. Let x t be a vector 
containing the "true" retention time and mass-to-charge 
ratio associated with the /th peptide. Then, if = 1, we 
assume our measured values, x* are normally distributed 
around X[ with some shift and scale along with an 
unknown co variance, x* : N(\i + S#;, . There is a 
small subset of isotope groups that have been identified in 



Lucas et al. BMC Bioinformatics 2012, 13:74 
http://www.biomedcentral.eom/1 471 -21 05/1 3/74 



Page 16 of 18 



all data sets. We initialize all of our parameters to maximize 
the likelihood of this small subset, then use a greedy 
algorithm to select matches for the remainder of the iso- 
tope groups. The algorithm stops assigning matches based 
on the prior probability that = 1 (we have used .5). 

There is substantial information available that is not 
being used in this algorithm. First, it would be possible to 
assign prior distributions to the model parameters and it- 
eratively fit this model. This would lead to better estimates 
of model parameters allong with full posterior distribu- 
tions. However, because the distribution is extremely 
spiky, the estimation of uncertainty from this algorithm is 
somewhat uninteresting and uninformative. Second, there 
is information available in the high energy mass spectrom- 
etry trace even when that trace is insufficient to fully iden- 
tify the peptide which one might include in our model as 
additional dimensions to %{. Third, we are not making use 
of the intensity of the measured isotope groups across the 
samples. This allows for the possibility that there may be 
drastic changes in peptide concentrations between the two 
experiments. Even so, it is likely possible to obtain and use 
reasonable and informative distributions on these inten- 
sities for the purposes of alignment. However, because all 
of our results are based on factors, which are the aggregate 
expression of multiple isotope groups, a low but non-zero 
level of inaccurate alignments may lead to a mild increase 
in noise, but not a drastic change in our overall results. 

Full model specification 

The list of variables we will use is as follows: 

• P is the number of isotope groups measured 

• AT is the number of samples in the study 

• X is a P x N dimensional array of intensity values 
with elements Xq 

• e is a P x N matrix of normally distributed noise 
(residuals). We assume that each isotope group, i, 
has its own noise variance, Tj 1 

• pt is a P-dimensional vector representing mean 
intensity for each isotope group. The elements of 
this vector are fa 

• B is a P x D matrix of loadings and H is aAfxD 
matrix of factors. These are intended to describe 
patterns of expression that span most or all isotope 
groups (as reflected by the priors). The elements of 
extitB are and the elements of H are h n ^> 

• A is a P x K matrix of loadings and A is a N x K 
matrix of factors. K is the number of metaproteins. 
These factors are distinguished from the previous 
ones by their prior structure. Specifically, A will 
contain only P non-zero elements, and the non-zero 
elements for any particular factor should consist of a 
large number of isotope groups from the same 
protein. This contrasts with B which contains no 



non-zero elements and describes expression patterns 
across the entire set of peptides. The elements of A 
and A are fak and respectively. 

• z is a P-dimensional vector (with elements Zj) 
containing, for each isotope group, the index of the 
metaprotein to which that isotope group has been 
assigned. 

• u 0 , 0o> v o> v o> §o> &o and are constants set at the 
time of model fitting. 

The full hierarchical model specification is as follows. 

X = ftl N +BH' +AA' +e 
ft : N{fa, 0 O ) 
b i4 :N(0,v 0 ) 

Ti : Gamma(yo, 5 0 ) 

&i,d\Zi*d = 0 

&i,d\Zi = d : AT(0, v 0 ) 

Zi : Multinomial (1, qi) 

qi : Dir(a 0 , a 0 , a*, a 0 , •••a 0 ) 

All of the constants are set in order to minimize the in- 
fluence of the prior distribution. Specifically, we use Uo = 8 
and 0 O = 100 corresponding to a 95% confidence interval 
on the log- transformed intensities between -12 and 28 
(significantly overdispersed relative to the empirical distri- 
bution). Similarly, we set v 0 = 100 which allows potential 
fold changes > 10 8 . We set v 0 and 5 0 both to .001 corre- 
sponding to a mean estimated residual of 1, but with a 
variance of 1000. We note that estimation of variance can 
be difficult, however, we have >100 samples in our study. 
For smaller studies, one might use more informative priors 
for variance. Both a 0 and are pseudocounts for our 
Dirichlet distribution. We use a 0 = .01 and aj< = 500. We 
note that this last parameter is the only one in the model 
that is potentially strongly informative, so we have tried 
values from 1 to 1000. We considered the percentage of 
peptides that are members of their dominant metaprotein. 
We find that, in our study population the affect of varying 
this parameter is to move that percentage between 43% 
and 53%. The relatively minor effect of this large change in 
pseudocounts is probably due, again, to our large sample 
size. This leads to very high data likelihoods for member- 
ship in particular metaproteins for almost all isotope 
groups. This parameter will have a larger effect in smaller 
studies. 

Additional files 



Additional file 1: Protein level grouping. Heatmaps for all identified 
isotope groups. The rows are sorted according to metaprotein 
membership and the columns are sorted so that the first principal 
component is increasing. Each isotope group is associated with a single 
metaprotein. Those metaproteins are labeled on the y-axis. 
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Additional file 2: Protein level, sorted by expression. Heatmaps for 
all identified isotope groups. The rows are sorted according from highest 
to lowest expression level and the columns are sorted so that the first 
principal component is increasing. Each isotope group is associated with 
a single metaprotein. Those metaproteins are labeled on the y-axis. 

Additional file 3; Analysis of spike-in data. Estimates of the protein 
levels of spiked in samples from the Super Hirn experiment. Also 
included are the results of one of the label substitution experiments. 

Additional file 4: Expression from SRM. Raw expression levels from 
the label free SRM proteomics assay for a subset of the original HCV 55 
cohort. 

Additional file 5: Shotgun proteomics data and Matlab code for 
analysis. Raw expression levels from the various shotgun proteomics 
experiments and Matlab code for performing the analysis described in 
this paper. 
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